Improvements
Since this was my first ever Kaggle competition and Machine Learning project, I was familiar with and could implement only the basics that I detailed. There were a lot more things I could have done.
-
Feature Engineering: I did only the minimum feature engineering, that is, feature selection with VarianceThreshold.
Instead, if I had studied each feature carefully with some EDA, measure feature importance with Mutual Information, and generated more useful features with clustering of important features, or PCA of features with highest linear correlation with target, or simple numerical operations on most important features, that could have improved my model significantly.
-
Hyperparameter Tuning: I performed hyperparameter tuning with GridSearchCV. But this was suboptimal since GridSearchCV can only scan through a limited number of options, tests all options in a brute force manner consuming a lot of time, and cannot be applied to something like StackingClassifier directly.
Instead, if I had used a framework like Optuna which uses a principled algorithm to scan through a wide range of combinations of parameters and can be used on StackingClassifer directly, that could have given me a more optimal estimator.
-
MLOps: I made a pipeline for encoding, imputing, scaling and transformation, then detected and removed outliers, then rebalanced classes through oversampling, then removed redundant features and finally applied the estimator.
All these steps, I applied individually to both th train and test data. But given a completely new datapoint, how do I apply all these steps in a single go to classify it?
This could have been done if a made a class with methods to perform each of the steps, and a final method to perform everything in a single go. Given any sample, this class methods could have been used on it directly to classify it.